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1 Monte Carlo Simulation of the Patterns of Selection and 
Diversity 

We introduce a Monte Carlo model aiming to regenerate the patterns of selection and diver- 
sity shown in Figure 4 in the main text. In this model, the sequence of the HAl domain con- 
tains a dominant epitope bound by the antibody and possessing a high evolutionary rate /ii = 
0.12 amino acid substitution/site/season, with the other a mino acid positions wit h a low evolution- 
ary rate yLt2 = 0.0034 amino acid substitution/site/season ( Ferguson et al. 2003f) . In each season. 



the numbers of positions in and out of the dominant epitope were defined as Li and L2, respectively. 
An ensemble of 1000 HAl sequences was created with identical amino acid identity in each of the 
329 positions, and was simulated from the 1969-1970 season to the 2009-2010 season. The histor- 
ical dominant epitopes of H3 hemagglutinin were epitopes A and B, each of which was dominant 
for about seven seasons (jGupta et al. l2006h . Therefore, the dominant epitope in the model was 



initialized as epitope A, and shifted between epitopes A and B every seven seasons. In each season, 
the numbers of amino acid substitutions in and out of the dominant epitope were randomly deter- 
mined from two Poisson distributions with mean values Ai = fJ-iLi and A2 = /i2i2, respectively. 
The positions of amino acid substitutions and the new amino acid identities were then randomly 
assigned. This process was repeated for all the 1000 sequences in each season. Following this proce- 
dure, we calculated for each position j the average relative entropy Sj, the number of seasons under 
selection Nj, and the average diversity Dj, using the simulated sequences between the 1993-1994 
season and the 2009-2010 season. We also present the histograms of Sj, Nj, and Dj. The results 
of these calculations are shown in Figure ISII 

The general picture depicted by this Monte Carlo simulation model reflects natural influenza 
evolution. The similarity between the results from the historical sequences in Figure 4 and those 
from the Monte Carlo simulation model in Figure [ST] suggests that the Monte Carlo simulation 
model captures a major part of the picture of influenza evolution. The Monte Carlo simulation 
model also suggests that the binding of antibody and the increased substitution rate in the dominant 
epitope are the significant features of influenza evolution. 

The annual evolution in the dominant epitope bound by the antibody decreases the affinity 
between virus and antibody and enables the virus to escape the immune memory of the virus 
circulating in the previous seasons. Th e evolutionary rate in the dominant epitope is fii — 
0.12 amino acid s ubstitution /site /s eason ( Ferguson et al. 2003f l. Both the historical sequences of 



the HAl domain (jShih et aLll2007f l and the Monte Carlo simulation model suggest that substitu- 
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tions of amino acid identities have occurred randomly across the dominant epitope. The relative 
frequency / {k,i,j) of each amino acid k in each position j in each season i shows that the amino 
acid substitutions were in random posit ions in each season and displayed few visible correlation 



between two positions ( Shih et al. 2007|). The Monte Carlo simulation model randomly selects 



the positions in which the amino acid is mutated, and generates similar patterns of selection and 
diversity to the historical data. 
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Figure SI: The results of the Monte Carlo simulation model containing epitopes A (blue) and B 
(red), and all the other positions. The model was simulated for 41 seasons, (a) Average selection in 



each position quantified by relative entropy calculated by Sj — J2i=i the last 17 seasons. 

The colors represent positions in epitopes A and B and positions outside the epitopes, (b) Number 
of seasons for each position when the relative entropy was greater than the threshold S'f^™^, i.e. the 
position was under selection, (c) Average diversity in each position quantified by Shannon entropy 
in the 17 seasons, calculated by Dj = X^Hi Dij /17. |(d)| Distribution of the average selection in 



each position displayed in (a) (e) Distribution of the numbers of seasons under selection displayed 



in |(b)[ 1(f) I Distribution of the average diversity in each position shown in (c' 
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Abstract 

Many viruses evolve rapidly. For example, hemagglutinin of the H3N2 influenza A virus 
evolves to escape antibody binding. This evolution of the H3N2 virus means that people 
who have previously been exposed to an influenza strain may be infected by a newly emerged 
virus. In this paper, we use Shannon entropy and relative entropy to measure the diversity 
and selection pressure by antibody in each amino acid site of H3 hemagglutinin between the 
1992-1993 season and the 2009-2010 season. Shannon entropy and relative entropy are two 
independent state variables that we use to characterize H3N2 evolution. The entropy method 
estimates future H3N2 evolution and migration using currently available H3 hemagglutinin 
sequences. First, we show that the rate of evolution increases with the virus diversity in the 
current season. The Shannon entropy of the sequence in the current season predicts relative 
entropy between sequences in the current season and those in the next season. Second, a global 
migration pattern of H3N2 is assembled by comparing the relative entropy flows of sequences 
sampled in China, Japan, the USA, and Europe. We verify this entropy method by describing 
two aspects of historical H3N2 evolution. First, we identify 54 amino acid sites in hemagglutinin 
that have evolved in the past to evade the immune system. Second, the entropy method shows 
that epitopes A and B in the top of hemagglutinin evolve most vigorously to escape antibody 
binding. Our work provides a novel entropy-based method to predict and quantify future H3N2 
evolution and to describe the evolutionary history of H3N2. 
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1 Introduction 



A common strategy by which viruses evade pressure from the immune system is to evolve and 
change their antigenic p rofile. Viruses with a low evolutionary rate that infect only humans, such 

37I). can be effectively controlled by vaccinating the human 
viruses with a high evolutionary rate, such as HIV, hepatitis B, and 



as the small pox virus ()Li et al. 
population. By contrast 



influenza A, resist being eliminated by the immune system by generating a plethora of mutated 
virus particles and causing chronic or repeated infection. In this study, we take subtype II3N2 
influenza A virus as a model evolving virus. Influenza A virus circulates in the human population 
every year, typica lly causing 3-5 million severe illnesses and 250,000-500,000 fatalities all over 
the world (jWHOl ) . Hemagglutinin (HA) and neuraminidase (NA) are two kinds of virus surface 
glycoproteins encoded by the influenza genome. The subtype of influenza is jointly determined by 
the type of hemagglutinin ranging from HI to H16, and that of neuraminidase ranging from Nl 
to N9. On the surface of the virus membrane, HA exists as a cylindrical trimer containing three 
HA monomers, and each monomer comprises two domains, HAl and HA2. Hemagglutinin is also a 
key factor in virus evolution, because it is the major target of antibodies, and HA escape mutation 
changes the antigenic character of the virus presented to the immune system. The H3N2 virus 
causes the largest fraction of influenza illness. H3 hemagglutin in is under selectio n by the immune 
response mainly on the flve epitope regions in the HAl domain (jWilev et aLlll98iri , labeled epitopes 
A to E, as shown in Figure [TJ The immune pressure and the escape mutation drive the evolution of 
the H3N2 virus. The und erlying mutation rate of the HA gene is 1.6x 10~^/amino acid position/day 
( Nobusawa fc Satol 12006), measured using the method modifled from that in an earlier study on 
the HA mutation rate ( Parvin et aLlll986l ). Note that the mutation rate does not necessarily equal 
to the evolution rate, or the fixation rate. The mutation rate equals to the evolutionary rate only 
if the evolution is neutral. The non-neutrality of the HA evolution is shown in the Results section. 
Evolution of the hemagglutinin viral protein causes occasional mis match bet ween the virus and the 
vaccine and decreases vaccine effectiveness (Deem fc Lee, 2003. : .Gupta et a/.l[20063 . As more amino 
acid substitutions a.re introduced iiit o infiuenza sequences, the antigenic characteristics of influenza 



strains drift away (ILiao et aLl 120081) . and i nfluenza epidemic severity of subtype HlNl (|Wu et 



[2OIQI and subtvpe H3N2 (jWolf oUbOlOt ) increases 

The H3N2 virus has a distinguished evolutionary history, largely affected by the immune pres- 
sure. The H3N2 virus emerged in the human population in 1968 and has been circulating in the 
population since 1968. The phylogenetic tree of H3 hemagglutinin since 1968 has a linear topology 
in which most sequences are close to the single trunk of the tree, and the lengths of the branches 
are short ([Ferguson et "aLl l2003l: iRusseh et al\ I2OO8I : ISmith et al\ 120041) . Historical hemagglutinin 
sequences fall into a series of clusters, each of which has similar genetic or antigenic features and 
circulates for 2-8 years before being replaced by the next cluster (jPlotkin et a/.l 120021 : ISmith et al. 
20041) . The evolution of different amino acid positions of hemagglutinin shows a remarkable het- 
erogeneity: a su bset of positions undergo frequent change, while some positions are conserved 



(|Shih et al\ 120071) . This heterogeneity is q uantified by the Sh annon entropy at each position of 



the amino acid sequence of hemagglutinin ( Deem fc Pan 2009[) . Shannon entropy has been used 
to locate protein reg ions with high diversity, such as the antigen binding sites of T-cell receptors 
( Stewart et "Zl ll997[ ). Shannon entropy has been used to identify antibody binding sit es, or epitopes, 
which are under immune pressure and so are rapidly evolving ( Deem fc Paul [20091 ). The hetero- 
geneity of amino acid substitution suggests that point mutations randomly occurring in distinct 
positions have different contributions to the virus fitness. 
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Figure 1: The tertiary structure of the HAl domain of H3 hemagglutinin (PDB code: IHGF). The 
surface of HAl facing outward is the exposed surface when the hemagglutinin trimer is formed. 
The other two HAl domains (not shown) in the HA trimer are located at the back of the structure 
displayed here. The solid balls represent five epitopes. Color code: blue is epitope A, red is epitope 
B, cyan is epitope C, yellow is epitope D, and green is epitope E. 
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The selection pressure on the H3N2 virus to evolve is reflected in the difference between the 
H3 hemagglutinin sequences in two consecutive seasons. We consider Northern hemisphere strains. 
When the epidemic initiates in a new season, we assume that each position of an HA sequence 
inherits the amino acid from a sequence of the previous season or has a different amino acid due 
to random mutation and selection. This assumption comes from the fact that the H3N2 virus 
circulating in each influenza season migr ates from a certa in geogra phic region in w hich the virus is 
preserved between two influenza seasons ([Rambaut et al ]|2008; Ru ssell et al\\200^ . In the absence 
of selection, the histogram of the 20 amino acid usage in one position in the current season is similar 
to that in the same position in the previous season except for changes due to the small random 
mutation rate. The difference between the two histograms beyond that expected due to mutation 
quantifies selection. 

Synthesizing these factors, we introduce an entropy method to describe the evolution of in- 
fluenza. The entropy method extracts an evolutionary pattern from aligned sequences. Shan- 
non entropy quantifies the amount of sequence info rmation in each position of aligned sequences 
( Schneider fc Stephens! [l990l : ISchneider et al\ The sequence information refiects the vari- 

ation, which is equivalently diversity, in ea ch position, and so Shannon entropy has been used 
to measure the div ersity in each position ( Gerstein fc Altman IQQSt Sander fc Schneider 1991 
Shenkin et al. Il99l[ ). Shannon entr opy has also been used to measure the s tructural co nserva- 



tion in the protein folding dynamics ( Mirnv fc Shakhnovich 19991 : Plaxco et al. IlipOO). See (|Valdaij 
2002h for a detailed review of the applications of Shannon entropy. On the other hand, relative 



entropy measures gain of se quence informatio n at each position and requires a background amino 
acid frequency distribution (| WilliamsonI 1 1 9951 ) . Rel ative entropy was also used as a sequence con - 
servation measure to detect functional protein sites ( Halabi et al. 2009t Wang fc Samudrala 20061 ). 
Further, a dimension reduction technique using re lative entropy has identified sectors in proteins 
( Halabi et al. 20091 : Lockless fc Ranganathanl 1999h . As an extension of these previous works, we 
apply Shannon entropy and relative entropy to jointly measure two quantities in each position: 
sequence information in one season and gain of sequence information from one season to the next 
season. Simultaneous analysis of Shannon entropy and relative entropy sheds light on the evolu- 
tionary pattern of the H3N2 virus evolution when data from multiple seasons are available. In the 
HAl domain, positions in the epitope regions have in creased Shannon en tropy, and this feature 
was applied to locate the epitopes of HI hemagglutinin ( Deem fc Pan 20091 ). We here use Shannon 
entropy to quantify the virus d iversity in each amino ac id position in each season. The entropy 
relative to the previous season (iKullback fc Leibleijll95ll) is also used to analyze the evolution of 
the HAl domain in one single season and to quantify the selection pressure on the virus in each 
amino acid position in each season. The selection and the virus diversity are two significant state 
variables determining the dynamics of evolution. 

The article is organized as follows. The Materials and Methods section presents the data used 
in this work and details of the entropy model. The Results section uses Shannon entropy of the 
sequence in one season to predict the evolution quantified by relative entropy from this season to the 
next season. Results are also presented for the flow of virus migration between the four geographic 
regions of China, Japan, the USA, and Europe. In the Result section, we demonstrate the entropy 
method in two applications, the results of which agree with prior knowledge on H3N2 evolution. 
Finally, we discuss our results and present our conclusions. 
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2 Materials and Methods 



2.1 Sequence Data 

The hemagglutinin sequences considered in this work are the amino acid sequences of the hemag- 
glutinin of human influenza A H3N2 virus. We only use Northern Hemisphere sequences because 
90% of the world population lives in Northern Hemisphere. The influenza season in Northern 
Hemisphere is defined as the time interval from October in one year to September in the next 
year. We downloaded 5309 Northern Hemisphere sequences labeled with month of collection from 
the NCBI Influenza Virus Resource on 16 January 2011. Sequences too short to cover residues 
1-329 of hemagglutinin were removed, and the remaining sequences were trimmed to only keep 
residues 1-329 in the HAl domain. Any sequence with an undefined amino acid in residues 1-329 
was removed. We consider 18 seasons from 1992-1993 to 2009-2010 during which most available 
sequences were collected. In total, we preserved and aligned 4292 sequences in these 18 seasons 
containing amino acids 1-329. 



2.2 Histograms of 20 Amino Acids 

The first step is to quantify the alignment of the amino acid sequences. The aligned historical 
H3 hemagglutinin sequences form a matrix A with 4292 rows and 329 columns. The element A/ j 
denotes the identity of the amino acid in sequence I and position j. The 4292 sequences were 
clustered into 18 groups by the seasons of sampling from the 1992-1993 season to the 2009-2010 
season. Note that most of the sequences before the 1992-1993 season were not labeled with month 
of collection and are excluded from this study. We denote by i = 0, 1, . . . , 17 the seasons between 
1992-1993 and 2009-2010. For position j in season i, the relative frequency of each amino acid k, 
f {k,i,j) ,k — 1, . . . , 20, was calculated from the vector A^-^^^ ^ where the index array I (z) holds the 
indices of sequences sampled in season i. 



2.3 Shannon Entropy as Diversity 



The Shannon entropy is one useful quantification of the diversity in single position. Large Shannon 
entropy has the physical meaning that the a mino acid in the give n position is prone to be substituted. 
This physical meaning was also applied in ( Deem fc Pan 20091 ). The diversity at a single position 
takes the format of Shannon entropy because of the sensitivity of Shannon entropy to diversity. 

This physical meaning of the Shannon entropy does not necessarily involve the joint frequency 
distributions for two and more positions, and we do not consider the joint frequency in the 
manuscript. Rather, we define the diversity only in the level of single amino acid position. Conse- 
quently, the defined diversity is additive for a number of positi ons. The idea of addm g diversity in 
each position of the sequence comes from classic works such as ( Schneider et al. 1986h . which added 
the Shannon entropy in each position to measure the total diversity in an aligned binding site. 

Therefore, diversity of the virus in each position in each season is represented by the Shannon 
entropy that quantifies the amount of information in the histogram or distribution under study. 
For the sequences sampled in all the seasons, positions w ith high evolutionar y rate have a higher 
Shannon entropy compared to the conserved positions ( Deem &: Pan 2009f ). The sequences in 
each season are assumed to be collected concurrently. The Shannon entropy is a quantification of 
diversity of amino acids in one position, and so the diversity in position j in season i is calculated 
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from the histogram f — [/ (1, i, j) , . . . , / (20, i, j)]^ by Shannon entropy 



20 

fe=i 

in which fc = 1, . . . , 20 is the identity of the amino acid in position j in season i. 
2.4 Relative Entropy as Selection Pressure 

Selection in each position j in season i is reflected by the significant difference between the 20-bin 
histogram in the current season f (i, and that in the previous season f (i — 1, j)- In the absence 
of selection, random mutation and genetic drift are the dominant forces generating f (i, j) from 
f (i — 1, j). In each position, random mutation creates a slightly modified histogram, from which 
amino acids are randomly chosen to appear in season i by the effect of genetic drift. 

The source of random mutation is the spontaneous error of the RNA polymerase replicating the 
influenza virus RNA. The random mutation rate in different regions of hem agglutinin is thought t o 
be homogeneous, regardless whether the regions are in antigenic sites or not (jlna fc Goiobori|[l99l . 



Therefore random mutation is modeled as a Poisson process M {fj,,g (fc)) equally affecting all the 
positions. Here u, is the m utation rate of influenza A virus that equals to 5.8 x 10~'^/residue/season 
( Nobusawa fc Satol 20061 ). and g{k) ,k = 1, . . . , 20 is the relative frequency of each amino acid in 



the whole alignment A. The probability that the original amino acid t mutates to amino acid u is 

So after mutating for one season, the histogram in position j in season i — 1 is obtained by 

i{i,j)^M{^i,g){{t~l,j). (3) 

This histogram serves as the background distribution for season i from which the sequences in 
season i are built. 

If selection is absent, the effect of genetic drift is to create sequences in the current season by 
randomly choosing amino acids in each position from a background distribution f (i, j). We denote 
by Ni the number of se quence in season i. The probability that A'^; amino acids in position j have 



Dy l\i tne number ot se quence m season i. 
the histogram i is (jHalabi et al. 2009f ) 



Pr{f(z,j)}«exp(-A^,5,,,) (4) 

where 

5,,=f;/(fc,z,,)log(j^ (5) 
fc=i J[k,i,j) 

is the relative entropy between the observed histogram, / (fc, and the background histogram, 



f{k,i,j) (iKullback fc Leibleilll95lD 



The null hypothesis that selection is absent in the evolution is rejected if the relative entropy 
Si,j is great enough such that the probability in Equation |4] is less than 0.05, that is, the relative 
entropy Sij is greater than — log (0.05) /Ni in season i. Note that the majority of residues were 
stable in most of the seasons, and in this case the relative entropy is Si,j = log (1/ (1 — fi)) ~ /i. To 
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avoid classifying these stable residues erroneously as positions under selection, a proper threshold 

of relative entropy needs to be larger than the mutation rate /i. Additionally, a fraction A of the 
circulating HAl sequences were not deposited in the database because of the sampling bias of the 
HAl sequences. In an extreme case, in a stable position j with the real histogram of 20 amino acids 

[1 — A, A, . . . , 0] in all the seasons, the histograms of the sequences sampled in two consecutive; 

T T 

seasons i — 1 and i are i {i — 1, j) = [1, 0, . . . , 0] and f = [1 — A, A, . . . , 0] , respectively, and 
so the relative entropy introduced by the sampling bias is 

5-«(l-A)logl^ + Alog-^ (6) 

in spite of the absence of selection in position j in season i. The relative entropy S^^'^ equals to 
0.1 if a samphng bias A = 2.5% exists in the HAl database sequences. We fix the threshold of the 
relative entropy in season i to 

5f ■■''^ = max I - log (0.05) /Ni, (1 - A) log + A log -A_ | « max {3/Nu 0.1} . (7) 

The numbers of collected HAl sequences iVj were fewer than 30 only in the 1995-1996 season {i = 3) 
with N3 = 25. The thresholds Sf""^^ = 0.12 > 0.1 in the 1995-1996 season due to the small numbers 
of HAl sequences. In all the other 17 seasons, the numbers of sequences Ni were greater than 30, 
and so the thresholds = 0.1. 



3 Results 

In this section, we show the positive correlation between the Shannon entropy in season i and the 
relative entropy from season i to season i + 1. This correlation means that the larger the virus 
diversity in one season, the higher the virus evolutionary rate from this season to the next season. 
We draw the H3N2 migration pattern by comparing the relative entropy. The migration pattern 
reveals a novel migration path from the USA to Europe and shows that the virus evolutionary rate 
is higher in the epicenter, China, than in the migration paths. We also demonstrate the entropy 
method in two applications. First, we compute average Shannon entropy and relative entropy in 
each position over the past 17 seasons to identify positions under selection pressure. Second, we 
compare Shannon entropy and relative entropy in epitope regions to find the contribution of each 
epitope to the H3N2 evolution. Results of these two applications agree with previous studies and 
additionally show the heterogeneity of the selection pressure over different amino acid positions of 
hemagglutinin, with increased pressure in the epitopes, as well as the dominance of epitopes A and 
B. 



3.1 Correlation between Shannon entropy and relative entropy 

Relative entropy Si+ij in amino acid position j from season i to season i + 1 linearly increases 

with Shannon entropy Di j in position j in season i. For the sequences sampled from the 1992-1993 
season {i = 0) to the 2008-2009 season (i = 16), 329 x 17 = 5593 ordered pairs {Dij,Si+ij) are 
calculated. All except two pairs fall into 8 bins in which the values of Dij belong to 8 intervals 
[0, 0.1), [0.1, 0.2), . . . , [0.7, 0.8), respectively. The first bin with Di j in [0, 0.1) is discarded because 
it contains numerous conserved amino acid positions. The values of Si+ij are averaged respectively 
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in each of the 7 remaining bins. As described in Figure[5J average relative entropy in each bin shows 
positive correlation with midpoints of the Dij interval, — 0.70. An amino acid position j with 
high Shannon entropy Dij in season i is expected to present high relative entropy Si+ij from 
season j to i + 1 . The evolution in position j from season i to i + 1 quantified by relative entropy is 
therefore predicted using the mean and standard error of Si+i,j in the bin chosen by Dij in season 
i. 

Positive correlation is also observed between the mean values of Dij and Si+ij in a variety of 
positions j in each season i. In each season between 1992-1993 {i — 0) and 2008-2009 {i — 16), we 
average the Shannon entropy j and relative entropy S'i+i,j over the positions j with Dij > 0.1. 
The data point with i = 1, (0.22, 1.38), has a large standard error of the relative entropy S'i+i,j and 
is excluded in the analysis below. The remaining average Shannon entropy (i = 0, 2, . . . , 16) 
correlates with average relative entropy (S')i+i {i — 0,2, ... , 16) with R"^ = 0.50, as shown in Figure 
[3l A least squares fit gives {S)i+i = 1.82{D)i — 0.23. Thus, the expected average relative entropy 
from the current season i to the next season i + 1 can be calculated from the average Shannon 
entropy {D)i in the current season i. 

The above relationships between Shannon entropy and relative entropy cannot be generated by 
a neutral evolution model. To demonstrate this result, we create an ensemble of 1000 identical 
sequences with 50 amino acid positions. Each iteration of the model simulates H3N2 evolution 
during one season. In each iteration, the number of mutated amino acids A^mut in each sequence 
follows a Poisson distribution with mean /i — 2.0, which is the annual substitution rate in history. 
The A'fnut mutated positions are then randomly assigned in the corresponding sequence. We ran- 
domly select pcut = 10% of the sequences to build the sequence ensemble in the next iteration. The 
Shannon entropy Dij and relative entropy Si+ij generated in iteration i — 51-100 are processed 
using the same method as for 113 sequences in history. First, as shown in Figure [21 no increasing 
trend appears in the means of Si+ij in the 7 bins from [0.1, 0.2) to [0.7, 0.8). Second, no correlation 
— 0.003) is observed between {D)i and (5)^+1, see Figure [H When we change the parameters 
/i between 1.0 and 10 and Pcut between 1% and 100% in the algorithm, the simulation still does 
not yield the visible increasing trend in Figure [2] or the correlation observed in Figure [3l As a re- 
sult, we conclude that neutral evolution alone is not able to generate the pattern between Shannon 
entropy and relation entropy of 113 sequences. It was previousl y shown that the fixation rate of 



II3N2 evolution cannot be explained only by neutral evolution (jShih et a/.l 120071 ). In this study, 
the monotonically increasing linear relationship between relative entropy and Shannon entropy in 
Figure [3 and [3] suggests that selection pressure substantially contributes to H3N2 evolution. 

3.2 Annual Virus Migration 

The entropy method is also used to analyze the global migration pattern of the virus. Most of the 
Northern Hemisphere H3 sequences were collected in East-Southeast Asia, the USA, and Europe. 



East- Southeast Asia is suggested to be the reservoir of the annual II3N2 epidemic ijRussell et al 



2008h . To increase the geographic resolution in East-Southeast Asia, we use two regions, China and 



Japan, as the representative of East-Southeast Asia, because each of these regions has a population 
over 50 million and has a consistent time series of 113 sequence data from the 2001-2002 to the 
2007-2008 season. 

We select the sequences in four geographic regions that are China, Japan, the USA, and Europe 
in seven seasons from 2001-2002 to 2007-2008. In all the six pairs of consecutive seasons, we 
calculate for each region four average relative entropy values of the whole sequence. These four 
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Figure 2: Mean and standard error of relative entropy S'i+ij in each bin of Shannon entropy. 
Shannon entropy and relative entropy in each of the 329 positions and in each of the 17 seasons 
between 1992-1993 {i = 0) and 2008-2009 {i = 16) fall into one of the eight bins. The first bin with 
Shannon entropy less than 0.1 is discarded. Bins with larger Shannon entropy Di j also have larger 
relative entropy (Sj+ij-. Shannon entropy Dij and relative entropy in iteration i = 51-100 

of the neutral evolution model are used to calculate mean and standard error of relative entropy in 
each bin of Shannon entropy distribution in the same way. No increasing trend is found. Error bar 
is one standard error. 
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Figure 3: Average Shannon entropy {D)i versus average relative entropy (5)^+1 for eaeh season 
between 1992-1993 (i = 0) and 2008-2009 (i = 16). For each season i, a set of amino acid positions 
j with Shannon entropy Dij greater than 0.1 are chosen. For all the j in this set of positions, 
(D)i is the average of the Shannon entropy Di j values and (S)i^i is the average of relative entropy 
Si-^-lJ values. Horizontal and vertical error bars are the standard errors of Shannon entropy and 
relative entropy, respectively. The solid line, (-S')i+i = 1.82(D)j — 0.23, is a least squares fit of {D)i 
to {S)i+i (i = 0; 2, ... , 16). A strong correlation with = 0.50 exists between {D)i and 
excluding the point (0.22, 1.38) with iVj = 1, which has a large standard error of the relative entropy 
Si+ij. Using the same method, {D)i and (<S')i+i are calculated from a neutral evolution model, 
i = 51-100, and plotted. No visible correlation exists between {D)i and (<S')i+i from the neutral 
evolution model. 
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Table 1: The relative entropy between hemagglutinin sequences in the different regions in the 
current influenza season and sequences in these regions in the previous season. The minimum 
relative entropy in each column is marked in bold. The p values of the Wilcoxon signed-rank test 
between the minimum relative entropy and other relative entropy values in the same column are 
in the parentheses. Hemagglutinin sequences were collected from four geographic regions: China, 
Japan, the USA, and Europe. Seven seasons from 2001-2002 to 2007-2008 are used here. The 
relative entropy values listed in this table are averaged for all the sites and all the six pairs of 
consecutive seasons. These results imply that the H3N2 viruses in China, Japan, and the USA 
migrate from China, while the H3N2 virus in Europe migrates from USA. 





Region of the 










current season China 


Japan 


USA 


Europe 


Region of the 










previous season 










China 


0.057 


0.040 


0.044 


0.064 
(0.0017) 


Japan 


0.114 


0.094 


0.076 


0.059 




(2.1 X 10-5) 


(0.0049) 


(0.0012) 


(0.032) 


USA 


0.105 


0.087 


0.070 


0.056 




(2.1 X 10-1°) 


(3.3 X 10-^) 


(6.4 X 10-5) 




Europe 


0.135 


0.115 


0.094 


0.074 




(3.8 X 10-^) 


(3.4 X 10-^) 


(4.8 X 10-^) 


(0.15) 



average relative entropy values are calculated using the sequences in each of the four regions in 
the previous season as the reference. The results are shown in Table [T] Sequences collected in 
China in the previous season yield the minimum relative entropy to the sequences in the current 
season collected in China {p < 2.1 x 10"^^ Wilcoxon signed-rank test), Japan {p < 0.0049, Wilcoxon 
signed-rank test), and the USA {p < 0.0012, Wilcoxon signed-rank test). Sequences in the USA in 
the previous season have the minimum relative entropy to the sequences in Europe in the current 
season {p < 0.15, Wilcoxon signed-rank test). Relative entropy data in Table [T] imply the virus 
migration from China, as the geographic reservoir, to Japan and the USA and suggest a migration 
from the USA to Europe. The result in Table [T] also implies that the H3N2 virus circulating in 
China seed the virus in China, Japan, and the USA in the next season, and the virus in the USA 
probably seed the virus in Europe in the next season. 

Comparison of the relative entropy data in Table [T] in the H3N2 reservoir and migration paths 
also reveals the H3N2 virus migration pattern. Using the Wilcoxon sign-rank test, the relative 
entropy data of the H3 hemagglutinin in China in two consecutive seasons is significantly greater 
than those in the three migration paths: from China to Japan {p = 0.035), from China to the USA 
{p = 0.0030), and from the USA to Europe {p = 0.0017). The relative entropy data, therefore, 
confirms China as the H3N2 reservoir and implies that novel H3N2 viruses are emerging in China, 
not during the migration process. 
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3.3 Positions under Selection 



The values of diversity as Shannon entropy Di j and selection as relative entropy Si,j are available 
for the sequences collected from the 1993-1994 season to the 2009-2010 season. First we apply the 
mean field approximation to remove the variation of selection and diversity over the time, and only 
consider the variation of Shannon entropy and relative entropy in different positions and regions 
over the past 17 seasons. A profile of the pattern of Shannon entropy and relative entropy in 
position j comprises the average selection, the number of seasons under selection, and the average 
diversity. The average selection Sj is expressed by the mean of relative entropy in each position 
over the 17 seasons 

1 

1=1 

and is displayed in Figure [^ |(a)[ The number of seasons under selection in each position j is 
calculated by 

17 

iV,=^i/(5.,,-5^-) (9) 



where H is the Heaviside step function. The numbers are shown in Figure |3] (b) The average 
diversity Dj in each position is calculated by averaging the Shannon entropy over the 17 seasons 
from 1993-1994 to 2009-2010 

1 " 



■t=i 



and is displayed in Figure 



Figure 13] (d) presents the distribution of the selection. Around 76% of the amino acid positions 
1-329 of hemagglutinin have an average selection close to zero and fall into the leftmost bin. The 
numbers of seasons when selection Sij in these positions were greater than the threshold level 5'*'^''°'' 



are shown in Figure |4] (e) The average diversities Dij in all the positions are shown in Figure |4] 



(f)[ If position j is under selection with Si,j > Sf^'^'^^ in greater than two of the 17 seasons between 
1993-1994 and 2009-2010, or Nj > 2, this position j is classified as a position under selection in 
the evolutionary history of H3N2 virus. The 54 positions with Sij > Sf^''°^ in greater than two 
seasons are listed in Table [2 

Patterns of selection and diversity similar to those observed in historical sequences in Figure 
H]are generated by a Monte Carlo simulation model, as displayed in Figure SI. The basis of the 
Monte Carlo simulation is that a ntibody binds to one of two epitope regions on the surface of the 
HAl domain (jCupta et a/.ll2006l ). and the dominant epit ope bound by antibody is under immune 



pressure and undergoes a higher substitution rate (Ferguson et al. 2003f) . The detailed description 



and discussion of the Monte Carlo model is in the Appendix. 



3.4 Comparison of Different Regions 



A human antibody binds to five epitopes in the H3 hemagglutinin (jWilev et aLl 119811 ). The five 
epitopes are located in different parts of the HAl domain of the cylinder-like structure of the H3 
hemagglutinin. Epitopes A and B are on the top of the HAl structure and are exposed in the 
HA trimer. Epitope D is on the top of HAl structure and is partly buried inside the HA trimer. 
Epitopes C and E are at the central area of the exposed surface of the HAl domain as shown in 
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Figure 4: 



(a) Average selection in each position quantified by relative entropy during the past 

Number 



17 seasons from 1993-1994 to 2009-2010, calculated by Sj = Y^iUSij/lT- The colors represent 
positions in epitopes A to E and positions outside the epitopes, as in Figure [1] 



(b) 



seasons for each position when the relative entropy was greater than the threshold S 



of 
the 



position was under selection, (c) Average diversity in each position quantified by Shannon entropy 
in the seasons from 1993-1994 to 2009-2010, calculated by Dj = Yl^J^i Dij/17. [(d)] Distribution of 



the average selection in each position displayed in (a) (e) Distribution of the numbers of seasons 
under selection displayed in |(b)[ |(f)| Distribution of the average diversity in each position shown in 

M 
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Table 2: Amino acid positions j under selection. To be included, the positions must be under 
selection, Sij > Sf^'^^^, in greater than two seasons. 



Region Amino acid positions 



Epitope A 122 124 126 131 133 137 140 142 144 145 

Epitope B 128 155 156 157 158 159 189 192 193 197 

Epitope C 45 50 273 275 278 312 

Epitope D 121 172 173 201 207 219 226 227 229 246 

Epitope E 57 62 75 78 83 92 260 262 

Out of epitopes 3 5 25 33 49 106 202 222 225 271 



Table 3: Annual selection, fraction of positions under selection, and diversity in epitopes A to E, 
positions not in any of the epitopes, and the whole HAl sequence. 



Region Selection Fraction of positions Diversity 

under selection 



Epitope A 


0.187 


0.152 


0.077 


Epitope B 


0.157 


0.134 


0.062 


Epitope C 


0.077 


0.087 


0.048 


Epitope D 


0.100 


0.072 


0.037 


Epitope E 


0.111 


0.094 


0.049 


Out of epitopes 


0.021 


0.019 


0.013 


The whole sequence 


0.060 


0.051 


0.028 



Figure [TJ Using the entropy method, we will show that epitopes A and B are under the highest 
average selection over all the seasons. These results can be interpreted as the antibody binds mostly 
to the top exposed part of the structure of hemagglutinin trimcr defined by epitopes A and B, and 
so the selection in these two epitopes is with higher intensity. 

We divide the HAl domain of the H3N2 hemagglutinin into six regions, namely epitopes A to 
E, and positions not in any of the epitopes. These regions show significantly distinct patterns of 
evolution. In each seasons from 1993-1994 to 2009-2010, we averaged selection and diversity in 
each epitope and the positions not in any of the epitopes. The fraction of positions j under selection 
defined by Sij > S^^^'^^ was also calculated. The averages for 17 seasons are listed in Table [31 It is 
evident that the values in Table [3] vary across the epitopes. The selection and diversity in epitopes 
A and B are greater than those in epitopes C, D, and E for each of selection, fraction of positions 
under selection, and diversity. The fraction of positions under selection is significantly greater 
than those in epitopes C, D, and E (p < 0.038, using Wilcoxon signed rank test). The values in 
epitopes C, D, and E are significantly greater than those not in any of the epitopes {p < 0.0019 for 
selection, p < 0.0011 for fraction of positions under selection, and p < 6.0 x 10^* for diversity, using 
Wilcoxon signed rank test). Consequently, epitopes A and B display the highest level of selection 
and diversity. 
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4 Discussion 



The Shannon entropy and the relative entropy are here introduced to quantify the diversity and 
the selection pressure of the evolving H3N2 virus. The foundation of the entropy calculation is the 
assumption that the virus sequences used in the entropy calculation are from a random unbiased 
sampling of the virus circulating in the human population. However, sampling density of the H3N2 
virus varies in different geographic regions, hence creating a sampling bias. 

We have addressed this issue at the continent level by analyzing data from different regions 
separately. We now additionally study the effect of sampling bias within one country. For example, 
among the H3 hemagglutinin sequences labeled with month of collection in the NCBI Influenza 
Virus Resource Database, the New York state sequences account for about one third of the USA 
sequences. In the contrast. New York state has only about 6.5% of the USA population. We chose 
eight seasons from 2001-2002 to 2008-2009 with abundant USA sequences collected in and out of 
New York state during this period of time available in the database. Using the procedure in the 
Materials and Methods section, we calculated the histogram of 20 amino acids in each amino acid 
position in each season for the New York state sequences, and that for the non-New York state 
sequences. Each of the 329 x 8 = 2632 pairs of histograms in and out of New York state were 
compared using the test for homogeneity. The p values of 2581 pairs are greater than 0.05. That 
is, 98.1% of the pairs are not significantly different. The high sampling density in New York state 
does not affect the histograms of 20 amino acids in each position, implying that the sampling bias 
of the II3N2 virus is uncorrelated to the amino acid usage patterns. 

By applying Shannon entropy and relative entropy to the aligned hemagglutinin sequences 
labeled with month of collection, we obtain the evolution and migration pattern of the H3N2 
virus in the Results section. First, Shannon entropy and relative entropy quantify diversity of and 
selection pressure over the virus, relatively. Relative entropy from the current season i to the next 
season i + 1 linearly increased with the Shannon entropy in the current season i. See Figure [2] and [3l 
Second, relative entropy quantifies the similarity of two groups of virus and implies the migration 
path of the H3N2 virus. See Table [TJ In the following text we compare our methods and results to 
the literature. 

The relative entropy reveals the II3N2 migration pattern. Previous studies appli ed phylogenetic 



metho ds in an attempt to locate the epicenter of the H3N2 epidemic in each season. iRambaut et 



42008') studied the dynamic of influenza sequence diversity in the temperate regions in both hemi- 
spheres to imply that the II3N2 virus orig inates in the tropics and migrates to the temperate regions 
in both hemispheres. iRussell et al. ( 2008[ ) obtained the antigenic and genetic evolution rate in each 
region and distances of the II3N2 strains to the trunk of the phylogenic tree. This information 
indicated East and Southeast Asia as the epic enter, from which t he II3N2 v irus spreads to North 
America, Europe, and Oceania in each season (jRussell et "Zl l2008h . Recently, [Bedford et oil (|2nifl[ ) 
suggested the center of the H3N2 migration network being China, Southeast Asia, and the USA 
by estimating the migration rate between different regions in the world. Here the relative entropy, 
the gain of sequence information, is used as a novel measure of the sequence similarity. The H3N2 
migration path is the directed graph in which each path has the minimum relative entropy, or 
the maximum sequence similarity. These studies reach a consensus that South China is located in 
the epicenter of infiuenza epidemics. Here, we additionally identify a novel migration path from 
the USA to Europe and show that virus evolutionary rate is higher in the epicenter than in the 
migration paths. 

Previous studies have identified positions that have led to the immune escape of infiuenza, by 
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resolving historical mutations. A 1997 study examining 254 H3 nucleotide sequences f rom 1984 to 



1996 identified 14 positions that are under selection, using the dN/dS ratio method (|Fitch et 



19971 ). In a 1999 study involving 357 nucleotide sequences f rom 1983 to 199 7, 18 positions were 



identified to be under selection, using the dN/dS ratio method ( Bush e^"allll999l ). A 2003 paper used 
the alignment of 525 nucleotide sequences froin 1968 to 2000 to calculate the codon diversity and the 
ammo acid diversity (|Plotkin fc Dushofj|2003l) . This paper reported 25 positions with the largest 
codon diversity and amino acid diversity. A 2007 paper located 63 positions of positive selection 
by ali gnment of 2,248 sequences from 1968 to 2005 and considering substitutions at the amino acid 
level ( Shih et al. 1[2003). Our study identified 54 positions under selection at the amino acid level, 
using 4,292 aligned sequences from the 1992-1993 season to the 2009-2010 season. Considering 
this historical body of work, it is apparent that the number of amino acid positions identified to 
be under selection has increased with the number and the time span of sequences used in the 
study and as the discriminating power of the data has increased. In addition, di fferent criteria to 
identify the posit i ons under selection have been introduced in the previous studies (IBush et aLlll999l: 



Fitch aLlll997l : IPlotkin fc Dushofj l2003t IShih et aZ.I [20071 ) and in the present studv. IShih et al. 
(l2007l) identified positions to be under selection when an amino acid substitution occurred during 
successive years. We here classify a position j as under selection if its relative entropy Si_j is greater 
than the threshold 5*'^''°*' in greater than two seasons. These two methods identify many identical 
positions as well as some distinct positions. 

The heterogeneity of the methods also contributes to identification of different sets of positions 
under selection. We note that these method s fall into two categories. The first category operates at 
the codon level. The diV/d5' ratio method ( Bush et al. 1999t Fitch et al. 1997) calculates the non- 
synonymous and synonymous mutation rate of the codon. The Plotkin fc Dusho^ (|2003l ) method 
comparing the codon diversity and the amino acid diversity is a variation of the dN/dS ratio 
method. The second category operates at the amino acid level. Shih et al. ( 2007t ) identified the 
positions with amino acid switch occurring in history to be under selection. Our entropy method 
recognizes the positions with relative entropy higher than the threshold, S}^'^'^^ , in greater than two 
seasons. A large dN/AS of a codon does not necessarily mean an amino acid switch in the same 
position because the amino a cid substitution could be unfixed. Methods at the amino acid level, 
such as the amino acid switch ( Shih et a/.ll2007h and our entropy method, can identify positions with 
low dA'"/dS' to be under selection because these methods do not consider nucleotide substitutions. 
Positive selection does not necessarily lead to a fixed amino acid switch, and in thi s case the entrop y 



method can still detect positive selection. Unlike the amino acid switch method (jShih et all 120071 ) 



the entropy method applied in this study is able to detect unfixed amino acid substitutions arising 
from selection. O ur entropy method releases the requirement of fixed amino acid substitution in 
(jShih et aLl 120071 ) but adds one requirement: the positions under selection need to present large 
relative entropy in greater than two seasons. Consequently, these methods identify slightly different 
sets of positions to be under selection. 



5 Conclusion 

We use Shannon entropy and relative entropy as two state variables of H3N2 evolution. The entropy 
method is able to predict H3N2 evolution and migration in the next season. First, we show that 
the rate of evolution increases with the virus diversity in the current season. The Shannon entropy 
data in one season strongly correlate with the relative entropy data from that season to the next 
season. If higher Shannon entropy of the virus is observed in one season, higher virus evolutionary 
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rate is expected from this season to the next season. Second, the relative entropy values between 
virus sequences from China, Japan, the USA, and Europe indicate that the H3N2 virus migration 
from China to Japan and the USA, and identify a novel migration path from the USA and Europe. 
The relative entropy values in and out of China, the epicenter, show that evolutionary rate is higher 
in China than in the migration paths. Moreover, the entropy method was demonstrated on two 
applications. First, selection pressure of the H3 hemagglutinin is mainly in 54 amino acid positions. 
Second, the top exposed part in the three-dimensional structure of HA trimer covered by epitopes 
A and B is under the highest level of selection. These results substantiate current thinking on 
H3N2 evolution, and show that the selection pressure is focused in a subset of amino acid positions 
in the epitopes, with epitopes A and B on the top of hemagglutinin being dominant and making 
the largest contribution to the H3N2 evolution. These predictions and applications show that the 
entropy method is not only predictive but also descriptive. 
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